How to search with locust

To get the best results from a search engine, it is important to understand the basic principles, strengths, and limitations of search engine technology.

Everything becomes known in comparison. Following this ancient wisdom, in order to better explain search engine technology, we will compare conventional human-assisted search with a computer-assisted search:

When you ask a librarian or a colleague to help you find documents, you describe what you are looking for in a free language form, using any words that help you to describe what you are looking for. A human understands concepts regardless of which particular words are used to describe them. The human memory is unlikely to remember the exact words used in a particular document, and it is usually able to recall the details of only several thousand documents.

In contrast, a search engine does not understand anything about the meaning of documents, concepts, or ideas. It is a computer tool that knows only keywords and simple logical operations. The computer is an idiot savant, able to find all documents containing specified keywords from millions or billions of documents in its database in a fraction of a second, something that would take thousands of years for a human to achieve. However, a computer will not help you if you cannot express what you need in terms of the only concepts it is capable of understanding: keywords and logical operations.

Modern search engines find documents that contain one or more keywords from a set of keywords specified by the user. Depending on the particular configuration of a given search engine, keywords may be expanded by the computer using a process called stemming - using variant grammatical forms of a keyword in a search. The computer usually does this by attaching suffixes to the root of the keyword. Search engines can also apply other conditions, for example, the position of keywords in relation to each other as they appear in the document, the obligatory or optional occurrence of a keyword in a document, the exclusion of keywords, and the occurrence of keywords in a specified part of the document (the title, the body, the URL, or the meta information). The combination of keywords and conditions submitted to the search engine is called a query.

Finding all the documents that match a query is the easy part. The problem is that usually too many documents are found. For example, a search for "European Union" through one of the installations of locust returns 105'738 documents. It would be impossible to read all or even a significant part of these. Therefore, the most important part of a search engine is its ranking algorithm, which ranks documents in an attempt to put those that are most likely to satisfy the user at the top of its result list.

General structure of locust

locust consists of four main parts - the web spider, the index database, the search daemon, and the front end.

The web spider or crawler is a backstage program that finds documents on the Internet, downloads their contents, preprocesses them for fast keyword search, and saves the preprocessed documents and optionally the original documents in the index database.

The index database stored on a computer disk. The database is optimized for keyword searches, allowing to find documents that match user queries and rank them in a fraction of a second. Building the index, however, means downloading and preprocessing millions of documents from the Internet, which takes much longer. locust can index tens of millions documents at the speed of about 3 million documents per 24 hours on a single Dell server.

The search daemon gets queries from the front end, searches the index for the documents that match the query, ranks them, and returns the highest ranking documents to the front end for presentation to the user.

The front end is implemented as a CGI executable that presents the user with the search page, receives the query, passes it to the search daemon, and displays results returned by the search daemon.

Ranking Principles

The ranking in locust has two components, the keyword component, based on the number and position of keywords in a document, and a page rank component, reflecting the link popularity of a document.

The keyword component gives preference to documents that:

Contain the most occurrences of the keywords
Have keywords close to each other
Have keywords close to the beginning of the document (configurable)
Have keywords in visually prominent positions (e.g. in headers or displayed in large fonts)

The link popularity component gives more weight to documents that have more links to them from other indexed documents.

Basic Search

A basic query consists of a sequence of one or more words that are typed in the search form. By default locust finds the documents that contain all the words present in the sequence, except for so-called stopwords - prepositions and other commonly used words (e.g. have, and, and of) which are generally not useful for document searches. For example, the following query returns documents that contain all four words somewhere in the document body, title, or meta-information.

Another form of simple search is a phrase search, where the query consists of a sequence of words in double quotation marks. This query finds documents that contain all the words in the order in which they were entered, excluding punctuation marks and stopwords. The following search will find a much more restricted set of documents than the previous search, and when an exact phrase is known, this search can help to find documents that would otherwise be lost in a more general search.

Words and phrases can be mixed in one query. The resulting documents will contain all the words and phrases present in the query.

Advanced Search

Advanced search features allow users to set conditions on the positions of keywords in documents relative to each other; choose whether the occurrence of keywords and their combinations in documents are obligatory or optional; exclude keywords; and choose where keywords should occur in documents (in the title, the body, the URL, or the meta information). Searches can also be restricted to documents contained in a subset of indexed sites.

The advanced search query syntax allows for a combination of atomic terms - words, phrases, and patterns (patterns are described below).

Patterns

A pattern is a sequence of normal characters and the special matching characters "?" and "*". The character "?" can stand for any single character and the character "*" can stand for zero or more characters. At least three normal characters must be used at the beginning of a pattern. Documents containing at least one word that matches the pattern are included in the search results. Patterns must be used with care. A pattern that is short and matches too many words can overload the search engine. The search

finds documents containing the words terror, terrorism, terrorist, and terrorists.

Boolean Search

Boolean expressions are composed from atomic terms using the Boolean operators AND, OR (also written as '&', and '|') and the negation operator '-'. The AND operator between the terms is assumed by default and can be omitted. A Boolean expression grouped using parenthesis can be used as an atomic term inside another Boolean expression. For example, the query

finds documents about NATO and Warsaw Pact missiles, excluding documents containing the abbreviation CDI (Center for Defense Information).

Note that the operators AND and OR must be uppercase, and no space is allowed between the negation operator "-" and the word following it.

Restricted Search

Searches can be restricted to documents contained in a subset of indexed sites. All sites with the URL containing one of the specified strings will be included in the subset. For example, the search

finds documents containing the words "Kyoto Protocol" on all the indexed sites belonging to the .int domain.

The documents from a subset of sites can be excluded by placing the negation operator in front of the site keyword.

Finding Links

The following query will show all the documents in the index database that have links to (www.isn.ethz.ch).

Known Issues

locust indexes PDF files. However, some PDF files contain scanned images and cannot be indexed.

Some websites have "black holes", folders containing dynamically generated junk pages that expand into many millions of distinct URLs. For example, some innocent looking tables of content can be black holes. Black holes prevent locust (and other search engines) from indexing real documents referred by them.